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Abstract 

We propose two operations to prevent sharing in Haskell that 
do not require modifying the data generating code, demon- 
strate their use and usefulness, and compare them to other 
approaches to preventing sharing. Our claims are supported 
by a formal semantics and a prototype implementation. 

Categories and Subject Descriptors D.l.l [Programming 
Techniques]: Applicative (Functional) Programming; D.3.3 
[Programming Languages]: Language Constructs and Features — 
Data types and structures; E.2 [Data storage representation] 

Keywords space leak, lazy evaluation, sharing, functional 
programming, natural semantics 

1. Introduction 

Thanks to the immutable nature of data in a pure functional 
programming language such as Haskell, there are many pos- 
sibilities for sharing, i.e. one object in memory can used in 
multiple places in the program. In general, this is a good 
thing, as it can save both execution time (by not calculating 
the data again) and memory space (by not copying the data). 

But there are cases where sharing can hurt, and sometimes 
hurt badly. A famous example (HQS) is the following func- 
tion: 

let I = [1.. 100000000] 

f :: [Int] ->■ Int 

f xs = last xs + head xs 
in f I 

This program is space-leaky and will quickly run out of mem- 
ory. If we substitute the term for xs in the body of f and eval- 
uate that expression, it runs quickly and in constant memory. 
We have avoided the sharing of xs between the calls to last 
and head and the list elements can be garbage collected as 
soon as they have been consumed by last. This came at the 
expense of evaluating the list twice, which is fine, as the list 
is large but cheap to calculate. 

But this source transformation, as well as othe r so urce 
transformations to avoid sharing (see Section 13.31 and |3.4t , 
is not always possible or desirable, e.g. when the parameter 
passed to f comes from library code not under the control 



of the programmer. Therefore, we propose a new primitive 
operation dup which copies a (possibly unevaluated) value 
on the heap. 



data Box a 

dup :: a — > 



= Box a 
Box a 



[Copyright notice will appear here once 'preprint' option is removed.] 



Its value semantics are that of (\x — > Box x); the wrapping 
in Box just serves the purpose of controlling the exact point 
of execution of dup by case-analyzing the Box. Using dup 
allows us to modify in the above example only the code of 
f to prevent sharing and achieve constant memory usage: 

let f xs = case dup xs of 

Box xs' —7- last xs' + head xs 
I = [1.. 100000000] 
in f I 

In Section [3] we demonstrate the use of dup and other ap- 
proaches on the more elaborate example introduced in Sec- 
tion[2] taking on the programmer's point of view. 

An sharp-witted reader with knowledge of a typical im- 
plementation of a Haskell runtime might already have no- 
ticed that just copying the object on the heap representing 
the parameter xs might not be enough: If, for example, the 
first cons-cell of xs is already evaluated, then d u p xs will copy 
that cell, but the thunk representing the tail of the list will 
still be shared between xs' and xs, and f will again devour 
memory. Such things may occur without the programmer's 
knowledge, e.g. during a compiler optimization pass. 

To that end, we propose a variant of dup, called deepDup, 
which effectively copies the complete heap referenced by its 
argument. This happens - as one would expect for anything 
related to Haskell - lazily: The objects referenced by the pa- 
rameter are copied if and when they are needed. In other 
words: After having evaluated a function which only works 
on deepDup'ed copies of its parameters, nothing this evalu- 
ation has created on the heap is referenced anymore, unless 
it is referenced by the function's return value (this is formal- 
ized in Theorem[2). 

Our specific contributions are: 

• We introduce primitives that give the programmer the 
possibility to explicitly prevent sharing. 

• In contrast to approaches based on source transforma- 
tions, using dup and deepDup does not require changes 
to the generating code. 

• We provide precise semantics in the context of Launch- 
bury's natural semantics for Lazy Evaluation (Section U) 
and prove that the recursive variant deepDup is effective. 

• We show the feasibility of our approach using a proof-of- 
concept implementation targeting code compiled by an 
unmodified GHC. (Section[5) 
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The problem specification 

type S = ... 
init :: S 

succs :: S — >• [S] 
value :: S — > Integer 

The search tree code 

data Tree = Node S [Tree] 
fstChild :: Tree — > Tree 
fstChild (Tree _ (x:xs)) = x 

tree :: S — > Tree 

tree s = Node s (map tree (succs s)) 

solve :: Tree — > [S] 
solve (Node n ts) = n : solve picked 
where 

rated = [ (t, rate depth t) | t <- ts ] 

picked = fst (maximumBy (comparing snd) rated) 

depth = ... 

rate :: Int — > Tree — > Integer 
rate (Node s _) = value s 

rate d (Node _ ts) = maximum (map (rate (d — 1)) ts) 

main = do 
let t = tree init 
print $ solve t !! 10000 
doSomethingElseWith t 



Figure 1. The running example 



2. The running example 

For the remainder of the paper, we will use one running 
example to demonstrate and discuss the use of dup. The task 
at hand, inspired by the minimax algorithm that searches for 
an optimal strategy in a two-player turn-based game, is to 
find a path through a (possibly infinite) tree that maximizes 
some valuation of the nodes. So abstractly, we have a type S 
of states, a valuation function value, an initial state init and 
for every state s, a list of successor states succs s. For the sake 
of simplicity of the presentation, we assume this succs s to be 
always non-empty (see also FigureQJ. 

Based on these functions, we define a search tree and 
a solver. The solver picks the successor with the highest 
rating, whereas the rating is the highest value of nodes at 
a configurable depth. 

Assume a constant number of successors b, b > 0, and 
that the value of depth is d. Consider what happens when 
we want to calculate the first 10 000 elements of the solution: 
The rate function will evaluate lots of nodes that will not be 
picked for the solution. But as they are still referenced by 
the tree t, the garbage collector cannot get rid of them. So in 
addition to the 10 000 interesting nodes, roughly 10 000 • (b — 
1) ■ nodes are evaluated that the programmer knows 
are not required to be kept around. The first row of Figure 
[3] depicts the heap during this evaluation, with d = 1 and 
b = 2. 

More concretely with d = 4, b = 4, type S = Word32 
and a very cheap succs and value functions, this program 
requires 4 189 MB of system memory (as reported by the 
GHC runtime as "total memory in use" when passed the 



-s option) and runs in 24.15 secondsQ Sharing is indeed the 
problem here: If we remove the last line of main, the program 
runs in 2 MB of memory and takes 6.70 seconds. 

3. Unsharing the example 

We want to improve the space performance of the program in 
the example and thus, due to the saved work in the garbage 
collector, also the runtime performance. In the following, we 
use dup, first wrapping the argument of solve, then the argu- 
ment of rate, and deepDu p. We also try two variants that work 
without new primitives, but require refactoring the generat- 
ing code. The statistics are collected in Figure|2] where all six 
strategies are applied to 

• an otherwise unreferenced tree, i.e. the example code 
without the last line of the main function, 

• a shared, unevaluated tree as shown in Figure[T] 

• a shared, unevaluated tree wrapped in another thunk, by 
passing (fstChild t) to solve, 

• a shared tree that has been partly evaluated forcing 
seq (fstChild t) before passing t to the solver, 

• a shared tree that has been fully evaluated by the unmod- 
ified solver before, 

• a shared, unevaluated tree that is processed twice by the 
(possibly modified) solver. 

In the two variants based on refactoring, the data type used 
for the tree does not allow for partial or full evaluation, so 
these runs are omitted. 

3.1 Using dup 

We now modify the example to use our new primitives. 
There are a few choices in doing so, with different trade-offs. 
One candidate for dup'ing is the function solve: We know that 
the parameter t to solve is an unevaluated expression, and de- 
coupling that from the t that we pass to doSomethingElseWith 
will allow the garbage collector to clean up the tree as solve 
proceeds to process it (Figure [3] second row). So we wrap 
solve in solveDup and use that in main. 

solveDup t = case dup t of Box t' — > solve t' 

And indeed, we have almost achieved the performance of the 
original program without sharing: 3 MB and 6.74 seconds. 

Another candidate for dup'ing is the function rate: As this 
is the function whose return value is taken into account when 
deciding whether to pick the argument or not, we know 
that in most cases, its argument will not be used any more. 
Therefore, by creating a wrapper rateDup that duplicates the 
argument, and using that in solve, we allow for the argument 
and all its children to be garbage collected once rate has 
finished. 

rateDup d t = case dup t of Box t' — > rate d t' 

Both the runtime and the memory footprint of the pro- 
gram are greatly reduced compared to the original program: 
It uses 5 MB of memory and takes 2.34 seconds to finish. It 
is surprising that this even surpasses the speed of the orig- 
inal program without sharing. The reason is that with rate 
wrapped in dup, the first child of the node under inspection 

1 All statistics are obtained on a machine with 2 GHz and 
sufficient (32 GB) RAM. The complete code used to gener- 
ate these statist ics is ava ilable in the ghc-dup repository at 
http://darcs.nomeata.de/ghc-dup 
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Figure 2. Time and space performance for b = 4 and d = 4 
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Figure 3. The heap during original and d u p'ed evaluation with b = 2 and d = 1 
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Figure 4. Comparing solveDup and solveDeepDup applied to a partly evaluated tree with b = 2 and d = 1 
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Figure 5. Time and space performance for b = 4 and d = 4 using an expensive succs function. 



of solve can be freed already when its next child is evalu- 
ated by rate (Figure [3] last row, second-to-last column), so 
the copying garbage collector needs to do even less work. 

3.2 Using deepDup 

Using dup is a fragile business and requires the programmer 
to have a very good idea about what is happening at runtime. 
It will fail, for example, in two common situations: If the call 
to solveDup in main in Figured] would not just pass the tree t 
but rather an expression referencing t, e.g. 

print J solveDup (fstChild t) !! 1000 

then dup will only copy this unevaluated expression, but 
both copies will reference the same unevaluated expression 
for t, and we are back at the original performance (4 188 MB, 
24.32 seconds). 

The same effect occurs if the tree is already partly eval- 
uated. This may even be caused by a compiler transforma- 
tion, e.g. the wrapper /worker transformation, assuming that 
doSomethingElseWith is strict in its argument [10]. Then, the 
parameter t is the Node constructor referencing other nodes 
or unevaluated trees, and copying the constructor does not 
help to prevent sharing the referenced data, as shown in the 
first row of FigureH] 

This is where deep Dup comes in: Intuitively, deep Dup 
takes a complete and private copy of the entire heap reach- 
able from its argument, hence preventing any unwanted eval- 
uation outside this copy. In fact this is done lazily: It will just 
copy the object specified by its parameter, and change all ref- 
erences therein so that before they are evaluated, deepDup 
copies them. 

So by wrapping solve in a call to deepDup: 

solveDeepDup t = case deepDup t of Box t' — > solve t' 

we achieve the performance of a successful run with dup 
(2 MB and 6.60 seconds), but also in the cases where t has al- 
ready been partly evaluated or is wrapped in another uneval- 
uated expression. The second row of Figure|4]shows deepDup 
at work. 

Using deepDup is therefore more reliable and easier to 
handle: The programmer need not have an exact idea of the 
evaluation state of the arguments when deepDup is called. 
And the recursive copying is surprisingly cheap: Even when 
the tree is already fully evaluated, e.g. by an earlier call 
to solve t !! 10000, the runtime stays the same within the 
precision of the benchmark. 

3.3 The unit type argument pattern 

The problem at hand is, of course, not new, and Haskell 
programmers have solved it one way or the other before, by 
rewriting the code to allow more control over sharing. 



A common approach is to replace values that you do not 
want to be shared by functions, e.g. by turning a bound ex- 
pression let x = e into a lambda expression let x = \() — > e. 
At every point in the program where e is required, one can 
get the value of it using x (); there will be no sharing between 
different calls to x () 

One needs to be careful, though, as some compiler opti- 
mizations can introduce unwanted sharing again. The code 

xs :: () -> [Int] 

xs () = [1.. 10000000] 

main = do 

print (last (xs ())) 
print (head (xs ())) 

works as expected without optimization. Passing -0 to GHC 
results in sharing again, as a result of the full laziness trans- 
formation. In fact, in a discussion of this example on the GHC 
bug tracker [12], Claus Reinke suggests an operation like dup 
to solve thisQ 

Applying this pattern to our problem, and aiming for a 
tree with unshareable subtrees, we can define the following 
types: 

data UTree' = UNode S [UTree] 
type UTree = () — > UTree' 

The required changes to the functions on trees are mechanical 
and guided by the type checker. The resulting code, when 
not hit by some optimization-induced re-sharing, shows very 
good time and space complexity. If sharing is desired at some 
points of the program, those parts will have to work with the 
regular Tree type, possibly leading to a duplication of code. 

3.4 Church encoding 

An alternative is to restructure the program so that the value 
that must not be shared is not represented using data con- 
structors but rather as a higher-order function |2, 5]. This 
transformation is known as the Church encoding of a data 
type, or a variant thereof. For the algebraic tree data type in 
our running example, we would obtain the following type 
and conversion functions: 

type CTree = forall a. (S — >■ [a] — > a) — S- a 
toCTree :: Tree — > CTree 

toCTree (Node s ts) f = f s J map (\t — > toCTree t f) ts 



2 If, however, the type signature of xs is not given, then no unwanted 
sharing happens even with - 0. The inferred most general type of xs is 
polymorphic with type class constraints. This implies that additional 
parameters are being passed under the hood and they successfully 
prevent sharing. 
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4. A natural semantics 



from CTree :: CTree — > Tree 
fromCTree ct = ct Node 

A church-encoded tree corresponding to the value tree s 
can be nicely created with the following code: 

ctree :: S — ¥ CTree 

ctree s f = f s $ map (\s' — > ctree s' f) (succs s) 

Unfortunately, adapting solve to this type is a non-trivial 
task, as the two recursions happening therein (solve and rate) 
need to be folded into one pass: 

csolve :: CTree — > [S] 
csolve t = fst (t csolve') 
where 

csolve' :: S -> [([S], Int Int)] -> ([S], Int -> Int) 
csolve' n rc = 

( n : fst (maximumBy (comparing (($ depth) o snd)) rc) 
, \d — > if d == then value n 

else maximum (map (($ d — l)osnd) rc)) 

This additional complexity might make this approach im- 
practical in larger settings. Note, though, that applying this 
pattern to the list data type turns a list into its right fold and 
can enable deforestation [J]. 

3.5 Comparison and interpretation 

As we can see from the statistics in Figured the unit type ar- 
gument pattern is the clear winner in both runtime and space 
performance. It is ahead of rateDup for the same reason that 
made rateDup faster than solveDup: Now even the subtrees in 
recursive calls of rate are freed immediately. Unfortunately, 
it requires a thorough refactoring of both the data generating 
and data consuming code; all combinators working on the 
data type need to be carefully rewritten to preserve the non- 
sharing behavior of the lifted data type. Also, the full laziness 
transformation can break the pattern, making it slightly frag- 
ile. 

The church encoding pattern shows good and predictable 
memory performance, but exhibits slightly worse runtime 
behavior. The cases where it is ahead of other approaches 
it wins only due to the garbage collector overhead induced 
by unprevented sharing. As the previous pattern, it requires 
extensive refactoring. 

Our primitives come with very small overhead when ap- 
plied to data that is actually unshared, as we show in the first 
column. In fact, careful use of dup can improve performance 
noticeably even if only small pieces of data can be un-shared 
and thus freed quickly. While dup is subtle to use, deepDup 
is robust and its effect is more precisely defined, as shown in 
the next section. 

Obviously, avoiding sharing is a bad idea when the result 
is expensive to create. Figure [5] runs the same benchmark 
with an expensive (but otherwise equal) succs function. If the 
tree needs to be processed twice, then throwing the result 
away after the first run (as done in the last column) results 
in a serious loss of run-time performance. Also for the same 
reason that the rateDup and unit lifting variants were faster 
before they now slow down the program, as parts of the tree 
are evaluated twice. 

On the other hand, if the memory footprint becomes 
larger than the available memory, being able to run the pro- 
gram slowly is still better than not being able to run it at all, 
so even in this case there can be uses for dup and deepDup. 



To substantiate our claims about the usefulness of dup and 
especially deepDup, we give them a precise meaning within 
Launchbury's natural semantics for lazy evaluation |Q] and 
prove that all memory allocated by a function whose argu- 
ments are wrapped with deepDup can be freed after the func- 
tion has been completely evaluated. 

We extend Launchbury's semantics for normalized lambda 
calculus with our two primitives: 

x,y € Var 

e € Exp ::= Xx. e | ex | x | 

let x\ = e\, . . . , x n = e n in e \ 
dupx | deepDupx 
T, A, 6 Heap = Var •+> Exp 
z E Val ::= \x. e 

His lambda terms are normalized, i.e. all bound variables are 
distinct and all applications are applications of an expression 
to a variable. 

The set of free variables of an expression e is fv(e). Simi- 
larly, the set of unguarded free variables ufv(e) of an expres- 
sion e, is inductively defined just like fv(e) with the exception 
that ufv(deepDup x) = 0. A value z is z with all bound vari- 
ables renamed to completely fresh variables. 

To avoid having to introduce constructors and case ex- 
pressions as well we assume dup and deepDup to return their 
result without the wrapping in Box. This captures the seman- 
tics of the Haskell expression 

(\x. let Box y = dup x in y) :: a — ¥ a. 

In addition to the unmodified reduction rules Lam, App, 
Var and Let, we add the two rules Dup and Deep in Fig- 
ure|6] The use of ufv(e) instead of fv(e) in the rule DeepDup 
is required to avoid a livelock if deepDup x is evaluated while 
x is itself bound to deepDupy. 

In the following every heap / term pair T : eis assumed to 
be distinctly named, i.e. every binding occurring in T and in 
e binds a distinct variable; this property is preserved by the 
reduction rules. 

Besides the natural semantics, Launchbury also defines a 
denotational semantics. He models values as a lifted function 
space, denoted Value, and environments 

p 6 Env = Var — > Value 

as functions from variables into values. He writes p < p' 
if p' extends p, i.e. they differ only for variables where p is 
bottom. The expression [e]p is the value of the expression e 
in the environment p. 

The semantics of a heap Y is given by {{T}}p, which is the 
environment p updated by the values specified in the heap. 
This is defined as a fixed point, as the heap may contain 
recursive references: 

n- ex,.. .,x„ h-> e n \p 

= iip'.p U (xj ^ [ ei ] p LI • • • U (x„ i-> [e„] p ,) 

This definition makes sense on environments p that are con- 
sistent with T, i.e. if p and T bind the same variable, then they 
are bound to values for which an upper bound exists. 

Launchbury proves his natural semantics to be correct 
with respect to the denotational semantics. Naturally, we 
want to preserve this property. Our new primitives should 
be invisible to the denotational semantics, hence we extend 
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T-.el). A:\y.e' A : e'[x/y] JJ. © : z A r : e JJ. A : 2 
-Lam ^- — — - — — App — -Var 



T :\x.e i^T :\x.e r:exJJ-®:z r,iK>e:i|A,iHz:z 

r, Xi i-> ei , . . . , x n i-» e„ : e Jl A : z r,n-4c,i'h->e:x'JlA:z x' fresh 

-Let : — Dup 



r *. let X\ = ei, . . . , x n = e n in e JJ- A : z r, x 1— » e : dup x JJ. A : z 

r,ji->e,3:'i-> • ■ • '3/n/y«]'3ni ^ deepDup yi, . . . ,yj, h-> deepDup y„ : x JJ- A : z 

ufv(e) = {yi,-..,y B } y^,...,y„ fresh DEEp 

r,x h-> e : deepDupx JJ. A : z 

Figure 6. Natural semantics extended for dup and deepDup 



the semantics function as follows: 

[dup zip := Ixjp 

[deepDup.rJp := 

Theorem 1 (Theorem 2 from |6]) If T : e JJ. A : z, then for all 
environments p, 

Mmp = Mm P ^ d S r }}p < « A }}p- 

PROOF. The proof in |6] is by induction on the derivation; we 
only have to give it for the two new cases corresponding to 
the rules Dup and DEEP. We assume that the fresh variables 
in the rules are chosen to be undefined in p: 

Case: dupx 

By induction, we know (i) {x'\{{T,x^e,x>^e}}p = I z l{{A}}p md 
(ii) {{r, x i-> e, x' i-> < {{A}}p. 
For the first part, we have 

[dup*]fli>-*}p 

= M{{r,.T^e}}p 
= W{{r,i^e}}p 
= Plfr^eS-p 

= WilV,x^e,x'^p x' fresh 

= \ x '\^T,x^e,x'^^p 

= WfA}p b y(i) 

as desired. 

The second part follows from (ii) and from x' being fresh: 
{{T,x 1 y e}}p < {{T,x i-> e,x' i-> < {{A}}p 

Case: deepDupx 

Let r' denote the heap in the assumption of the rule, i.e. 
r, x 1 — y e,x' 1 — y e[yj/yi, . . . ,yj,/y„],yj t-y deepDupyi, . . . , 
y' n i-> deepDupy„. By induction, we know (i) [x'Jgpjp = 
[z] {{A}v and(ii){{r'}}p<{{A}}p. 

The newly introduced variables y i = 1, . . . , «, have the 
same semantics as their original counterparts: 

Il3/il{{r'}}p = [ dee PDupyil{{r'}} P = M{{F}}p = hil{{r,x^e}} P - 

This implies (iii) le[y' 1 /y 1 ,...,y' n /y n ]j{{r}}p = l4{{T,x^e}} P - 
Hence 

[deepDupx]^ r/:cM . e j p 

= M{{r,.T^e}}p 

= ¥[y'i /y y'n /yn]l{{T<\}p by (Hi) 



= M{{p}} P 

= HfAJp by(i) 
and, by (ii), 

{{r,x^> e }}p<{{r><{{A}}p. ■ 

More interesting than the semantic correctness of our ad- 
ditional rules is what properties of deepDup we can prove 
with them. Following our intuition from the introduction, we 
formulate the next theorem, where T C A means that I and A 
agree on the domain of T and only new variables are bound. 

Theorem 2 Consider the expression 

e = let x\ = deepDupxj, . . . , x' n = deepDupx n in e' 

with fv(e') C {xj,. . .,x' n }. If T : e JJ- A : z and z is a closed 
value (i.e. fv(z) = 0), then T C A. 

This implies that any value on the heap A that was created 
during the evaluation of e can be freed afterwards. 

The theorem is an immediate consequence of statement 
(a) of the following Lemma[3] with Tq = T. We will need the 
notion of the unguarded reachable set urj- (e) of an expression 
e in a context T, which is mutually defined for all expressions 
as the smallest sets which fulfill the equation 

ur r (e) = ufv(e) U Ux 6u fv(e) ur r(L x). 

Note that ufv(e) C ufv(e') implies urj-(e) C urr(e'). 

Lemma 3 Let Tq be a heap and Li = dom Tp its domain. If 
T : e JJ A : z, T C T and U n ur r (e) = 0, then 

(a) T C A, 

(b) Lfnur A (z) = 0and 

(c) ii n ur r (y) = implies U n ur A (y) = for y e domT. 

PROOF. The proof is by induction on the structure of the 
derivation r : e JJ. A : z. 

Case: Xx. e 
Immediate. 

Case: e x 

From urj-(ex) = urr(e) U urj-(x) and the assumption 17 n 
ur r (e x) = we have (I n ur r (e) = and U n ur r (x) = 0. 
From the first induction hypothesis we obtain (i) Yq C A, (ii) 
U n ur A (\y. ef) = and (iii) U n ur A (x) = 0. 

As ur A (e'[x/y]) C ur A (Xy. e') U ur A (x), (ii) and (iii) imply 
U n ur A (e'[x/y]) = 0. With (i) we obtain (a) T C and (b) 
U Pi ur@(z) = from the second induction hypothesis. 

Statement (c) follows immediately from the induction hy- 
pothesizes. 
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Case: x 

Removing a variable from a heap does not increase unreach- 
able sets, so urr-(e) C urp YH ^ e (e) Q urr,xi-)-eM- From x € 
ur r,.vn-e( x ) an d the assumption Li n urr/^i-^x) = we have 
x ill, thus Tq C r, and Li n urp (e) = 0. From the induction 
hypothesis we now obtain To C A and Li PI ur A (z) = 0. As 
A C (A,x i — ^ z), ufv(z) = ufv(z) andur A (z) = ur^ xi _^ z (z), 
the statements (a) To C (A, x >— > z) and (b) Li n ur A T ^ z (z) = 
follow. 

Let y 6 domTo with Li n urr /!C i->e(y) = 0. As urr(y) C 
urr,. YI -^e(y) we have Li n urr (y) = and hence !I n ur A (y) = 
from the induction hypothesis. This and (b) imply (c), as 
UTA A n-z(y) £ ur A (y) Uur A (z). 

Case: let X\ — e\, . . . , x n = e n in e 

For brevity, let Y' = Y, x\ i— > e\, . . . , x„ i— » e n and e; = let Xj = 
£i,...,x„ = e„ in e. Clearly Tq C T C T'. Also, for each 
e* 6 {e,ei, ...,£»} we have ufv(e*) C ufv(e ; ) U {x\, ...,x n }. 
This implies 

ur r (e) = ufv(e) U Uxeufv(e) ur r (r' x) 

C ufv(e / )U{x 1 ,...,x„} 

UU.veufv( e( ) ur r'(F' x) 

U ur r (r' JCi) U • • • U ur r (T' x„ ) 

= Ufv(e;) U {Xi,...,X n } 

UU.reufv( e ,) ur r'(F' x) 
Uurp(ei) U • • • Uur r /(e„) 

= ufv(e,) U {xi,...,x n } 

UU.Teufv(<' ( ) ur r'( r/ *) 
= ur r (e / )U{x 1 ,...,x„} 
= ur r (e,)U{x 1 ,...,x„}. 

As all bound variables are distinct from variables in the 
heap, no x, 6 Li. From Li n urj-(e;) = 0, we have Li (~l 
urp(e) = and statements (a) and (b) follow from the 
induction hypothesis. 

For y 6 domT the unreachable set of y cannot contain 
any of X\,...,x n , as the heap /term pair Y : e\ is distinctly 
named, so we have urj-(y) = urp(y) and (c) follows from 
the induction hypothesis. 

Case: dupx 

Clearly Yq C Y, x h- e C Y, x \-t e, x' i— > e. Also, 

urr,.YH^,xw(x') = urp xi ^ e; ^ e -(e) U {x'} 
= urr>-fe(e) U {x'} 
£ urr>-«(dup*) U {x'}. 

As x' is fresh, x' ^ Li and from Li n urp^^dupx) = we 
have Li n ur r iH! ,, ;[ / l4f -(i') = 0, so the first statement follows 
from the induction hypothesis. 

Statement (c) follows immediately as x' is fresh. 

Case: deepDupx 

Let r' denote the heap Y,x n- e,x' i-> e[y\/y\, ■ ■ ■ ,y' n /yn], 
y\ n- deepDupyj, . . . ,y\ i— > deepDupyj. Recall that, by defi- 
nition, ufv(deepDupx) = 0, hence urp(deepDupx) = 0. So 

urp(x') = {x / }Uurp(%' 1 /y 1 ,...,y'„/y„]) 
= {x'} Uufv(e[y^/y 1 ,...,y; i /y„]) 

U U 2 6ufv(%;/ yi ,...,y;,/y„]) UTp(r' z) 



Info pointer 



Payload 



Code pointer 
Layout info 



Other fields 



Entry code 



Figure 7. The common layout of heap objects 

= {x'}u{yi,...,y'„} 

UU«=i „urp(r' y[) 

= {x'}U{yi,...,y'„} 

U Ui=l n urp(deepDupy,) 

= {x'}u{y[,'.'.,y' n } 

and, as these are all fresh variables, Li n urp (x! ) = 0. Clearly, 
Yq C (r,iHf) C T', so the first statement follows from the 
induction hypothesis. 

Statement (c) follows immediately as the additional vari- 
ables are fresh. ■ 

Having cast our intuition of dup and deepDup into a pre- 
cise form using a formal semantics, we now explain how we 
have implemented this semantics, or rather a pragmatic ap- 
proximation, in a real environment. 

5. The prototype implementation 

Our implementatior0 works with the Glasgow Haskell Com- 
piler (GHC), version 7.4.1, and requires no modifications to 
the compiler or its runtime: The code is compiled to a usual 
object file, linked into the resulting binary and called via the 
foreign function interface. 

GHC compiles Haskell code first to a polymorphic, explic- 
itly typed lambda-calculus called Core Il4l. Ilfll , then to the 
Spineless Tagless G-machine (STG) |9]. From there, it generates 
Cmm code, an implementation of the portable assembly lan- 
guage C— which is then compiled to machine code, either 
directly or via LLVM. 

Our work looks at objects in the sense of the STG, so we 
only need to worry about data representation on the heap [9]. 
Design decisions regarding the earlier transformations, such 
as the evaluation model |7], are thus not important here. 

The common layout of all objects, or closures, on the heap 
is a pointer to a statically allocated info table, followed by the 
payload (Figure|7}. The info table indicates the type of the ob- 
ject (not to be confused with the type from the type system - 
these are completely irrelevant at this stage), contains layout 
information about the payload required by the garbage col- 
lector, namely what words are pointers to other objects and 
what words are not, and the code to be run when the object 
is evaluated. 

There are various types of objects on the heap, most im- 
portant are: 

• Data constructors, representing fully evaluated values. The 
payload are pointers to the parameters of the constructor. 

• Function closures, representing functions. Locally defined 
functions capture their free variables, these are stored in 
the payload. 



3 Available at http://darcs.nomeata.de/ghc-dup 
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dupClosure { 

clos = UNTAG(Rl); 

// Allocate space for the new closure 

(len) = foreign "C" closure_sizeW (clos "ptr") [] ; 
ALLDC_PRIM(WDS(len) , R1_PTR, dupClosure); 
copy = Hp - WDS(len) + WDS(l); 
P = 0; 

for: // Copy the info pointer and payload 
if (p < len) { 

W_[copy + WDS(p)] = W_[clos + WDS(p)]; 

p = p + 1; 

goto for; 

} 

RET_P(copy) ; 



Figure 8. The Cmm code for dup 

• Thunks, which are unevaluated expressions. Again, the 
payload contains references to their free variables. 

• Applications of a function to a number of arguments. This 
closure type is usually only used by the GHC interpreter, 
but we use it in the implementation of deepDup. 

• Indirections, which point to another object on the heap in 
their payload. These are created during evaluation and 
removed by the garbage collector. 

When a thunk is evaluated, it is replaced by an indirection 
which points to the result of the evaluation, which can be a 
data constructor or a function closure. This way, when an- 
other reference to the thunk is evaluated, the computation is 
not repeated but the calculated result is used directly, hence 
the result is shared. The indirections do not stay around for- 
ever: The next garbage collector run, which copies all live 
data, will replace references to indirections by whatever the 
indirection points to. 

As we want to avoid this sharing, we need to prevent the 
original reference to be replaced by the indirection. We can- 
not change the code of the thunk, but we can copy the thunk, 
thus creating a new copy that is not referenced by other code, 
and then evaluate that. The essence of the surprisingly sim- 
ple code is listed in Figure|SJ the closure to duplicate is passed 
in the register Rl and Hp is the heap pointer which is increased 
by ALL0C_PRIM. 

As discussed in Section |3^2l this simple approach is not al- 
ways sufficient, and we want a recursive variant, deep Dup. 
This function, shown in Figure [9] needs to access the info 
table of the closure to figure out what part of the payload 
is a pointer to another heap object. For every referenced ob- 
ject, an application thunk is created which applies deepDup 
(or rather the variant deepDupFun with the better suited type 
a — > a), unless we are about to deepDup a deepDup thunk. In 
that case, we just copy it, but leave the argument alone, re- 
flecting the use of ufv(e) instead of fv(e) in the Rule DEEP in 
the formal semantics. The code listing does not include a few 
shortcuts, e.g. data constructors without pointer arguments 
such as integer values are not copied. 

5.1 Limitations of the implementation 

Our implementation is but a prototype; it does not yet work 
in all situations. One large problem is posed by statically allo- 
cated thunks: A value, say nats = [0..], defined at the module 
level is compiled to a thunk with closure type thunk_STATIC, 
also called a constant applicative form (CAF), and receives 
special treatment by the garbage collector. Copying such a 



deepDupClosure { 

clos = UNTAG(Rl); 

// Allocate space for the new closure 

(len) = foreign "C" closure_sizeW (clos "ptr") [] ; 
ptrs = TO_W_C/,INFQ_PTRSC/,GET_STD_INFO(clos))) ; 
bytes = WDS(len) + ptrs * SIZE0F_StgAP + WDS(ptrs); 
ALL0C_PRIM (bytes, R1_PTR, dupClosure); 
copy = Hp - WDS(len) + WDS(l); 
P = 0; 

fori: // Copy the info pointer and payload 
if(p < len) { 

W_[copy + WDS(p)] = W_[clos + WDS(p)] ; 

p = p + 1; 

goto fori; 

} 

// Do not wrap deepDup thunks again 
if (W_[copy] == stg_ap_2_upd_inf o kk 

W_[copy + WDS(l)] == Dup_deepDupFun_closure ) { 
goto done; 

} 

if 

P = 0; 

for2: // Wrap all referenced closures in deepDup thunks 
if(p < ptrs) { 

ap = Hp - bytes + WDS(l) 

+ p * SIZEQF_StgAP + WDS(p); 

W_[ap] = stg_ap_2_upd_inf o ; 

W_[ap + WDS(l)] = Dup_deepDupFun_closure ; 

W_[ap + WDS(2)] = W_[clos + WDS(p)]; 

W_[copy + WDS(p)] = ap; 

p = p + 1; 

goto for2; 

} 

done : 

RET_P(copy) ; 

} 



Figure 9. The Cmm code for deepDup 



closure to the heap using the code above would make the 
garbage collector abort, as it does not expect a static thunk 
to be found on the heap. But it is not possible to change the 
type of the closure, as the info table containing the type lies 
directly next to the code. And in order to create a modified 
info table somewhere else, the code needs to be copied as 
well. Therefore, dup and deepDup currently does not work 
for static thunks. When it is passed such a thunk, it prints a 
warning and returns the original reference, retaining sharing. 

It should be possible for dup to support static thunks with 
some additional information in the compiled code. Currently, 
when execution enters a static thunk and the stack and heap 
checks have been passed, the thunk is replaced by an indirec- 
tion into the heap and an update frame is pushed on the stack. 
If there was a way to jump over the code that sets up the indi- 
rection and update frame, e.g. via an alternative entry point 
included in the info table, dup could create a thunk on the 
heap that calls the static thunk via this route, effectively kick- 
ing off evaluation without affecting the original static thunk. 
For deepDup things are more complicated, as references to 
static objects are not part of the heap object, but are scattered 
throughout the machine code. Moving these references to the 
heap would solve the issue here at hand, but is clearly too ex- 
pensive. 

Also, the prototype does not take multithreaded pro- 
grams into account and will likely produce bad results when 
used in such an environment, e.g. when another thread re- 
places a thunk by an indirection during the thunk copy loop 



joachim Breitncr: dup - Explicit un-sharing in Haskell 



8 



2012/7/10 



in dupCiosure. Similarly, there are several specialized closure 
type (arrays, mutable references, weak pointers 1 8] and others 
fill, page HeapObjects]). For each of them, we would need 
to determine whether they can be safely duplicated and if so, 
whether this is actually useful. 

In the presence of Lazy IO, duplicating thunks can be out- 
right dangerous: Not only can the original and the dupli- 
cated thunk evaluate to different values but this can make 
the program crash, e.g. when one copy is done evaluating 
and causes a file to be closed, while the second copy contin- 
ues to read from it. Generally everything implemented with 
unsafePerformlO is prone to behave badly when combined 
with dup or deepDup. 

Function closures need special treatment as there are cases 
where code assumes a certain reference to always be a func- 
tion closure and never a thunk that will evaluate to a func- 
tion. But this is what deepDup wants to create. Currently, 
deepDup will in this case leave the reference as it is. A so- 
lution would be to copy the function closure eagerly, so that 
the reference in the copy again points to a function closure. 
This would require more sophisticated code to detect cycles. 

6. Conclusions and further work 

While Haskell gives the programmer great devices to get 
their programs to do the right thing, such as referential trans- 
parency and the type system, she has less means to analyze 
and control their runtime behavior. Several commercial users 
have mentioned this as one of the main drawbacks of Haskell 
0, [3, Q3]. This problem deserves more attention and we 
hope that this work is one step towards a Haskell with better 
controllable and understandable time and space behavior. 

We have shown the feasibility of an explicit sharing- 
preventing operator in a lazy functional language. We pro- 
vided two variants, dup and deepDup, the former is simpler, 
but possibly more subtle to put to use effectively, the lat- 
ter works more predictably, but may impose a larger perfor- 
mance penalty. This is, on a prototypical level, possible with 
an unmodified Haskell co mpile r. 

As described in Section 15.11 there is work to be done on 
the implementation before it can be used in production code. 
Some of that might require changes to the compiler code. 
Given how sensitive the code is to changes in the runtime 
representation of Haskell values, a productive version of dup 
would probably have to be shipped along with the compiler. 
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